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Abstract. The paper develops a technique for solving a linear equation Ax = 
b with a square and nonsingular matrix A, using a decentralized gradient al¬ 
gorithm. In the language of control theory, there are n agents, each storing 
at time t an n-vector, call it Xi(t), and a graphical structure associating with 
each agent a vertex of a fixed, undirected and connected but otherwise arbi¬ 
trary graph Q with vertex set and edge set V and S respectively. We provide 
differential equation update laws for the Xi with the property that each Xi con¬ 
verges to the solution of the linear equation exponentially fast. The equation 
for Xi includes additive terms weighting those xj for which vertices in Q cor¬ 
responding to the 2 -th and j-th. agents are adjacent. The results are extended 
to the case where A is not square but has full row rank, and bounds are given 
on the convergence rate. 


1. Introduction. Among the contributions of John Moore was a significant body 
of work, largely conducted with the last author of this paper, in which methods of 
control theory were applied to provide algorithms solving various problems of linear 
algebra, for example, matrix diagonalization. Typically, differential equations were 
constructed whose solutions evolved in time to a steady state containing the desired 
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result. For example, for a given square matrix, the equation might be initialized by 
the matrix and converge to a steady state solution that is a diagonal matrix with 
eigenvalues equal to those of the given matrix. Many of these results are set out in 
the book [1]. 

This paper offers another contribution along these lines: we show how to solve a 
linear equation Ax = 6 in a distributed way by a network of agents. Different from 
parallel algorithms in computer science [2-4] which usually require a common shared 
memory, a major novel feature of distributed algorithms lies in their implementation 
on multi-agent networks in which agents are provided with local memories and can 
communicate with each other. In the parallel world, the graph is chosen by the 
designer to maximize efficiency of computation in some sense whereas in ours it 
may be given by other considerations such as communication requirements. When 
a linear equation of interest involves millions or more unknowns, it is likely to 
be impossible for a single memory to even store the whole linear equation. Such 
large scale linear equations can easily arise in electromagnetic field problems, in 
which the boundary integral technique is employed to recover the solution of a 
differential equation in space by an appropriate integration of the solution on the 
two-dimensional boundary surface [5]. Another potential problem with the shared 
memory in parallel algorithms is the data safety. Security concerns often arise due 
to the fact that different agents are usually not necessarily in the same domain 
of trust. For example, when a customer turns to cluster computers for the help of 
computations which contain sensitive information such as business financial records, 
personally identifiable health information, etc, the customer may not be willing to 
share all such information with computers in the cluster [6]. One natural way to 
avoid these problems is by developing such distributed algorithms for multi-agent 
networks. Since agents may also be physically separated from each other, each 
agent typically is able to communicate only with certain other nearby agents. There 
are typically communication constraints on the information flow across multi-agent 
networks, which consequently preclude centralized processing and result in great 
interest of distributed algorithms. 

One direction for solving linear equations in a distributed way is by reformulat¬ 
ing them as a distributed optimization problem and then trying to employ existing 
algorithms in [7-10] to solve them. Rather than go through the intermediate step of 
problem reformulation, the authors of [11-13] have recently proposed a distributed 
algorithm for directly solving linear equations based on a so-called agreement prin¬ 
ciple, which was implicitly used in [14]. Here is the key idea: each agent limits the 
update of its state vector to satisfy its private equation, which is part of the linear 
equation Ax = 6; at the same time a control is developed to drive all agents’ states 
to reach a consensus vector, which means that this consensus vector must be the 
solution of Ax = b. Algorithms obtained along this direction were first shown to 
work for non-singular matrix A on fixed undirected networks in [11] and then were 
generalized to handle time-varying networks in [13]. 

In contract to the discrete-time algorithm proposed in [13], this paper proposes 
a continuous time algorithm and in so doing provides a different perspective on 
solving a linear equation in a distributed manner. The particular update rule for 
agent i’s state is 



jeAti 
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where with n = {1,2,..., n}, Mi denotes the set of labels of agent i’s neighbors’ 
from which agent i receives information, Xi is the state vector associated with agent 
i, and the particular form of the coefficients fij will be given subsequently. Now a 
feature of the algorithm is that it is, using the language of control theory, a form of 
consensus algorithm. This means that the different Xi{t) converge as t > oo to a 
common value, which will be the solution of the equation Ax = b. However what is 
probably the most distingishing feature of this work is its use of a basic result from 
differential geometry, which is perhaps not so well known in control theory, to give 
a more or less immediate proof of the paper’s main result. In particular we use the 
fact that a gradient flow algorithm associated with a real analytic function on a real 
analytic Riemannian manifold necessarily converges to a fixed point. This result is 
perhaps better known for the case of gradient flow algorithms in Euclidean space, 
see [15]; its more general form on Riemannian manifolds goes back to work of [16]. 

The structure of the paper is as follows. In the next section, we motivate the 
form of the differential equations and state the main result. The following section 
proves the main result, and Section 4 offers some comments on the convergence 
rate. Section 5 contains remarks relevant to future research. 


2. Establishing the consensus differential equations. Our starting point is 
the algebraic linear equation 


Ax = 

b 

(2) 

in which A G is non-singular and b G 

aJ denoting the i-th row of A as 

R’". 

We shall rewrite the equation with 

ajx = 

bi 

( 3 ) 

ajx = 


( 4 ) 



( 5 ) 

a^x = 


( 6 ) 


We shall later impose the requirement that A is nonsingular, but for the moment 
the requirement is not in force. 

The second rewriting of equations can be interpreted as stating that a; is a mem¬ 
ber of n different manifolds (here affine subspaces), viz ajx = 6^. Our approach 
to finding x relies on finding n vectors Xi (t) each of which belongs for all time to a 
particular manifold, in that 

ajxi{t)=bi, iGn (7) 


and so moving the Xi that they become more and more like each other. If and when 
they take a common value, call it x, it is obvious that it must satisfy the equation 
Ax = b (whether or not A is invertible). 

The basis for adjusting the Xi{t) involves setting up a cost function which pe¬ 
nalizes any differences between them. To this end, consider also a connected graph 
Q = (V,f) with n vertices, and each is associated with an n-vector Xi. Consider 
also the following cost function; 

V(xi,X2,...,X,i) = ^ ^ Wxi-XjW^ ( 8 ) 

Clearly V achieves a global minimum of zero if and only if all Xi have a common 
value. (Connectivity of the graph is essential to conclude the ‘only if’ statement). 
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Now let Mi denote the manifold ajXi = bi, and let M = Mi x M 2 x • • • x 

2 

Mn- Formally, one can regard M as the set of vectors X in R" obeying n scalar 
constraints of the form 

[0,0,...,a7,0,...,0]d:’ = 6,. 

(One should think of d:’ as a vector obtained by stacking the Xi). The function V is 
obviously defined on M" , but it is also defined on the affine subspace M C M” . We 
observe that V is clearly convex. This implies that the restriction V\M oiV to 
is convex too because the Mi are affine spaces, which shows a very useful property 
oi V\M. In the following proposition, we state two easily established properties 
linking the cost function to the linear equation. 


Proposition 1. Assume that the equation Ax = b is solvable. 

(i) The restricted cost function P : Ad —> R possesses a global minimum X £ M. 

It is unique if and only if Ax = b has a unique solution. 

(a) The local minima ofV\M coincide with the global minima, and these in turn 
coincide with the set of critical points ofV\M. 


The second part (ii) in the above Proposition is an immediate consequence of 
standard properties of smooth convex functions on vector spaces M. 

If we were to regard the function V as defined on and sought to compute 

a gradient flow from an arbitrary initial point in an attempt to reach the minimum, 
there would result 

Xi = - {xi - Xj), i £ n (9) 

j&Xi 

While this would result in a steady state in which all Xi had the same value, they 
would not be guaranteed to remain on the original manifold. To remedy this defect, 
we obtain the gradient of V on the manifold M. The way this is done in the opti¬ 
mization literature [1,17], given that the manifolds Mi and therefore the manifold 
M are embedded in a Euclidean space, is to simply project the gradient onto the 
tangent space of the manifold. Because AI = Mi x M2 x ••• x Mn, it is not hard 
to check that this is equivalent to projecting Xi onto the manifold Mi, Because the 
manifold Mi is an affine subspace, the tangent space is the set of vectors z £ R” 
for which a^z = 0. Now given an arbitrary y £ M", its projection onto the tangent 
space of the manifold Mi is simply Piy, where Pi denotes the orthogonal projection 
matrix to the kernel of aJ and for non-zero Oi one has 


P^ 


Q»aC~ 

aJ Oi 


( 10 ) 


where Pi denotes the projection matrix. More precisely, this means that replacing 
(9), we have for the gradient flow on the manifold A1 


Xi Pi ^ ] (xj ^j'), t £ ri. 
j&Xi 


( 11 ) 


Of course we also require that the trajectory begins on M, i.e. we require 

ajxi{0) = bi ( 12 ) 

It is easy to check using the differential equation that aJ Xi{t) = 0, which implies 
the trajectory stays on the manifold. 

And now we can state the main result: 
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Theorem 1. Consider the linear equation Ax = b with A G and b G R". 

Consider also a connected graph Q — (V,f) with n vertices, and let Mi denote the 
neighbor set of vertex i. If A is nonsingular, the equation set (11) with the initial 
condition (12) has the property that Xi(t) x for all i G n as t ^ oo. Moreover, 
convergence is exponentially fast. If A G for some m and has full row rank, 

convergence occurs to a solution of Ax = b. 

The proof of Theorem 1 will be given later. One might well imagine that in the 
course of solving the differential equation set, round-off or other errors could move 
the Xi{t) of the relevant manifold Mli. This can be accommodated by adding a 
further term to the equations which restores the trajectory towards the manifold. 
The equations remain linear in character. 

Corollary 1. Adopt the hypotheses of the theorem, save that (11) is replaced by 



(13) 


and let the initial condition now be free. Then the conclusions of the theorem con¬ 
tinue to hold. 

The proof of Corollary 1 will be given in the next section. Further generaliza¬ 
tion of (11) and (13) can be achieved by introducing scalar positive gain constants 
(varying with i) in the equations, thus (13) could be replaced, for arbitrary positive 
a and at, by 



(14) 


3. Proof of Main Results. We start this section with the proof of Theorem 1. 
By way of overview, we know from the general theory of real analytic gradient 
flows outlined in [16] that convergence must occur to an equilibrium point of the 
equations, and that because the manifolds Mi and thus the manifold M are closed, 
the equilibrium point must lie in M and thus the equilibrium values of the Xi 
must lie in each Mi. The equilibrium is necessarily a critical point of V. Now 
Proposition 1 (ii) indicates that the only critical points of V are the global minima 
of VjM. Under the hypothesis that A is square and nonsingular, or that it is 
has full row rank, there is an x such that Ax = b. An equilibrium point of the 
equations is given hy Xi = x for all i, which means that X G M assumes the form 
X = [x^,x^, ..., x^]^). Such a value of X gives rise to U = 0, i.e. corresponds to a 
global minimum. Conversely, at a global minimum, we can argue that X must take 
the form indicated, and then it follows that Ax = b. For suppose, in order to obtain 
a contradiction, that at a global minimum, there held X = [xj,xj ... ,x)^]^, and 
Xi ^ Xj for some pair ij. Because of the connectedness of the graph Q, there is a 
path of edges in £ connecting vertex i to vertex j. Since Xi ^ Xj, there must hold 
for some edge, call it rs, along this path that Xr ^ Xs, and then immediately V is 
seen to be nonzero since jjxr — is one of the summands making up V. Thus V 
does not attain its global minimum. Hence a necessary and sufficient condition for 
any equilibrium point to which the equations converge is that consensus is attained, 
i.e. Xi = Xj for all i,j and the common value is a solution of Ax = b. 

We now give a more explicit algebraic derivation of the same conclusion, which 
will be useful in pinning down the exponential rate of convergence. Let L denote 
the Laplacian matrix associated with Q] note that the connectivity property for Q 
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implies that L is nonnegative definite symmetric, with one eigenvalue at the origin, 
and the associated eigenspace is the span of 1, where 1 denotes the vector with 
entries all 1. Let denote the unit vector in K." with 1 in the i-th position. Denote 
the matrix formed by placing the Xi G K" next to one another as 


X= [xi X2...Xn] 


(15) 


(This should be distinguished from X, which is obtained by stacking the Xi). At 
the equilibrium, there holds 


- [7- = 0,i = 1,2, ...,n 

a a* 


(16) 


This implies that 

XLCi — XiCLi^ % — 1, 2,..., n (f'^) 

for some scalars Xi (which may be zero). In turn, with A = diag[Ai], there results 


XL = [ai 02 ... a„]A 

Since 1 is in the kernel of L, there results 


[Oi 02 ... o„] 


Ai 

A2 


= 0 


(18) 


(19) 


and because the vectors oi, 02 ,..., o„ are independent under the theorem hypothesis 
(whether or not A is square), we have that the Xi are all zero, that XL = 0 and 
therefore the columns of X are identical, i.e. consensus holds. 

Exponential convergence follows from the fact that the equations are linear and 
time-invariant. If convergence occurs, it is necessarily exponential. This completes 
the proof of Theorem 1, and we shall return to the question of the rate of convergence 
subsequently. 


To prove the corollary, we let = Xi — x* , where x* is a constant vector such 
that Ax* = b. From = Xi and (13) one has 

ii = -Pi {xi - Xj) - p—{ajxi - bi), i G n (20) 

which together with bi = aj x* implies 

Ci = -Pj {{xi - X*) - (xj - X*)) - Xi - ajx*) 

Thus 

Bi = —Pi ^ ^ (ci — ej) — {I — Pi)ei, i G n (21) 

jeXi 

To write the equations (21) in a more compact form, we let e = [ e[ ••• 

P = diag[Pi, P 2 , ■ • ■, Pn] and L = L ® In- One has 

e = -{PL + I-P)e (22) 


Proving that all Xi converge to x* exponentially fast is equivalent to proving that e 
converges to 0 exponentially fast, for which we need the following lemma: 


Lemma 1. 7/ker A = 0, all eigenvalues of PL + I — P are real and positive. 
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To prove Lemma 1 we need the following lemma about eigenvalues. 

Lemma 2. If M and N are square matrices of the same size, then MN and NM 
have the same eigenvalues. 

Proof of Lemma 2: By Sylvester’s Determinant Theorem [18], one has 

det(/ - \mN) = det(/ - \nM) (23) 

A A 

for A 0. Thus MN and NM share the same non-zero eigenvalues. Moreover, by 
det(MfV) = det(M) det(A^) = det(iVM), one has if 0 is an eigenvalue of MN or 
NM, it must be also the eigenvalue of the other. Thus MN and NM share the 
same eigenvalues if they are both square of the same size. I 

Proof of Lemma 1: Let us first establish that the eigenvalues of PL + I — P are 
identical with those of PLP + I — P. (We will then analyse the eigenvalues of the 
latter matrix) To see this, observe that because P^ = P, 

PL-P = P^L -P^ = P{PL - P) 

Hence the eigenvalues of PL — P are the same as those of {PL — P)P = PLP — P 
by Lemma 2. Consequently, the eigenvalues of PL -\-1 — P are the same as those 
of PLP -\-1 — P. Observe first that PLP + I — P is positive semi-definite since 
PLP and I — P are positive semi-definite. Thus to prove Lemma 1, it is sufficient 
to establish that PLP + I — P is non-singular. We will prove this in the following 
by assuming the contrary to obtain a contradiction. 

Suppose there is a nonzero a for which [PLP + {I — P)]a = 0. Then clearly 

LPa = 0 (24) 

(7 -P)a = 0 (25) 

It follows that La = 0 and so a = 1 (8> q for some q G R". Then again from the fact 
that (7 — P)a = 0, we obtain q = 0 for all i G n, whence q and thus a is zero. 
Then PLP + I — P is non-singular. ■ 

The fact that exponentially fast convergence occurs means that if b is varying, 
the algorithm will still lead to an approximate consensus solution of Ax = b, with 
the quality of the approximation linked to the rate of variation of b. 

4. Convergence rates. There is an alternative way of writing the equations (11) 
in the following compact form 

A" = -PLX (26) 

where X = [ xj xj ■ • ■ j""". Evidently, when the Oi are linearly indepen¬ 

dent, there are n eigenvalues of PL at the origin, and the remainder have negative 
real parts. The rate of exponential convergence of the system (26) to its equilibrium 
is determined by the smallest non-zero eigenvalue of PL which we shall denote by 

P- . . 

Note that PL and PLP have the same eigenvalues because PL = P{PL) and 
P{PL) and {PL)P have the same eigenvalues. Thus p is real and positive and is 
equal to the smallest non-zero eigenvalue of PLP. In order to find a bound for p, 
we will study non-zero eigenvalues of PLP. 
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Let Q = [ Q Q ] be a square orthogonal matrix such that the columns Q and 
Q form a basis for ker P and Im P, respectively. Then 

Q^PLPQ = 


The last equality (29) comes from PQ 
for Im P and P^ = P. 

Thus all non-zero eigenvalues of PLPQ are the same as those of the non¬ 
singular matrix LQ. From Q = In{n-i) ^md the Poincare Separation Theorem 
[19], one has 

Ai(Z) < Ai(QTLg) < a„+i(T) (30) 

where Aj(-) denote the jth smallest eigenvalue of a Hermitian matrix. Since L = 
L ® In and L is the Laplacian of a connected graph, one has 

Ai(Z)=0, A„+i(Z) = A2(L) (31) 

Recall that p is equal to the smallest non-zero eigenvalue of PLP, which is similar 
to Q^PLPQ. Then 

P = X^{Q^LQ) (32) 

From (30) to (32), one reaches a trivial lower bound for p, which is 0, and the 
following upper bound: 

Theorem 2. The smallest non-zero eigenvalue of PL is upper bounded by 

P < HL) (33) 


L[0 PQ ] 


_0 

Q^P 

0 _ 0 _ _ 

0 Q^PLPQ 

0 _ 0 _ _ 

0 Q^LQ 

= Q since the columns of Q forms a basis 


(27) 

(28) 
(29) 


Remark 1. \ 2 {L) is called the algebraic connectivity of a graph, 
below by [20] 


A2(L) > 


nD 


with D the diameter of the graph, and is bounded above by 

n 


A2(L) < 


n — 1 


It is bounded 
(34) 


forn > 2 with equality holding if and only if the graph is complete [21]. This further 
gives an upper bound of p in terms of the number of agents in the network. 

From ( 34 ) one observes that X 2 {L) could be very small, which suggests p may 
be close to 0 for certain graphs. Moreover, (32) implies that p is related to both 
L and Q, the latter of which is determined by the matrix A. It may be impossible 
to obtain a non-trivial lower bound for p without assuming more about A beyond 
non-singularity. 


5. Conclusions. There are a number of issues that are related to the ideas of this 
paper, but remain unexplored. We comment briefly on some of them. 

In case the matrix A is tall, in general the equation Ax = b cannot be solved, but 
it does make sense to search for a least squares solution. Theoretically, this could 
be obtained in the full rank case by working with the linear equation A^Ax = A^b, 
but any such approach requiring the initial computation of A^A and A^b might be 
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contrary to the spirit of seeking a decentralized solution because of the associated 
computations. For example, in distributed parameter estimation [22], a multi-agent 
network aims to solve a group of observation equations AiX = hi. Because the hi are 
usually contaminated with measurement noise, the whole linear equation Ax = b 
is usually overdetermined. How then to obtain a least squares solution is for the 
moment an open problem. An alternative idea to obtain the least square solution in 
a distributed way was briefly mentioned in [13] by solving a larger linear equation 
than Ax = b. However, this idea requires each agent to control an augmented state 
vector the dimension of which does not scale well with the number of agents in the 
network. This scaling problem has recently been partially ameliorated in [23] by 
assuming a sparse structure for the linear equations of interest. 

It would clearly be relevant to contemplate algorithms in which rows of A were 
grouped together. The manifolds Mi would be defined by equations like AiXi = bi 
where Ai was a fat matrix. The changes to the algorithm and the associated proof 
are trivial to contemplate. One issue is what the difference in computational burden 
might be. 

Evidently, the fact that the only requirement on the graph is that it be connected 
allows the overlaying of a concept like sparseness for each of the equations for Xi, 
that is the differential equation for Xi need only have at most one other Xj feeding 
into it. How to exploit sparseness in the matrix A is less clear. Obviously though, 
inner products involving ai are easier to compute when A is sparse. 

In the introduction, we noted that a discrete form of the algorithm is available, 
see [11-13]. It should be immediately derivable from the equations presented here. 
Whether there could be issues of stiffness that would cause problems in discretizing 
the equations is unknown. 

In numerical linear algebra, the difficulty of executing certain calculations is fre¬ 
quently evaluated, and often the formula involves the dimension of the underlying 
matrices. It is evident here there are tradeoffs possible between convergence rate, 
storage requirements and complexity requirements in an implementation of a dif¬ 
ferential equation solver. How all these interact and what sort of comparison can 
be made with conventional counts for matrix inversion is unknown. 

There are several other more significant directions in which this work might 
be developed. First, one could seek to add linear inequalities or more generally 
convex inequalities into the problem statement. Reference [14] can be thought of as 
explaining the use of consensus for distributed optimization in discrete time, and 
its existence is prima facie evidence of the reasonableness of seeking to generalize 
the ideas of this paper. Second, in coding theory there are sometimes requirements 
to solve linear equations with tall matrices whose entries are in a finite field. One 
might speculate as to whether a consensus approach could work, and be of benefit, 
for such problems. Of course, the notion of projection onto manifolds would not 
be expected to play any role. Last, we comment that the device of representing 
the equation to be solved as a consensus problem, and using a series of manifolds 
as a lens through which to view the problem, is a device that can find application 
to many control tasks. For example, consider the task of aligning three coordinate 
frames each in This is a consensus problem on a sphere, and gradient flow on a 
sphere is easy to compute. Thus one can readily devise an algorithm to secure the 
alignment. 
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